Determining Initial Starting Conditions for Documents Clustering
نویسندگان
چکیده
Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high dimensional and sparse vectorsa few thousand dimensions is typical. Practical approaches to clustering such document vectors use an iterative procedure (e.g. k-means, EM) that is known to be especially sensitive to initial starting conditions (k and initial centroids). In this paper, we introduce a hybrid clustering algorithm that determines these initial conditions automatically, depending on the required quality for the obtained clusters. The hybrid algorithm combines the agglomerative hierarchical approach with the k-means approach to provide k disjoint clusters. However, the textual, unstructured nature of documents makes the task considerably more difficult than other data sets. We present the results of an experimental study of our introduced algorithm.
منابع مشابه
Automatic generation of initial value k to apply k-means method for text documents clustering
Retrieving relevant text documents on a topic from a large document collection is a challenging task. Different clustering algorithms are developed to retrieve relevant documents of interest. Hierarchical clustering shows quadratic time complexity of O(n 2 ) for n text documents. K-means algorithm has a time complexity of O(n) but it is sensitive to the initial randomly selected cluster centers...
متن کاملRefining Initial Points for K-Means Clustering
Practical approaches to clustering use an iterative procedure (e.g. K-Means, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distr...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملInitialization of Iterative Refinement Clustering Algorithms
Iterative refinement clustering algorithms (e.g. K-Means, EM) converge to one of numerous local minima. It is known that they are especially sensitive to initial conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition leads to ...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کامل